46 research outputs found
Recommended from our members
Continually improving grounded natural language understanding through human-robot dialog
As robots become ubiquitous in homes and workplaces such as hospitals and factories, they must be able to communicate with humans. Several kinds of knowledge are required to understand and respond to a human's natural language commands and questions. If a person requests an assistant robot to take me to Alice's office, the robot must know that Alice is a person who owns some unique office, and that take me means it should navigate there. Similarly, if a person requests bring me the heavy, green mug, the robot must have accurate mental models of the physical concepts heavy, green, and mug. To avoid forcing humans to use key phrases or words robots already know, this thesis focuses on helping robots understanding new language constructs through interactions with humans and with the world around them. To understand a command in natural language, a robot must first convert that command to an internal representation that it can reason with. Semantic parsing is a method for performing this conversion, and the target representation is often semantic forms represented as predicate logic with lambda calculus. Traditional semantic parsing relies on hand-crafted resources from a human expert: an ontology of concepts, a lexicon connecting language to those concepts, and training examples of language with abstract meanings. One thrust of this thesis is to perform semantic parsing with sparse initial data. We use the conversations between a robot and human users to induce pairs of natural language utterances with the target semantic forms a robot discovers through its questions, reducing the annotation effort of creating training examples for parsing. We use this data to build more dialog-capable robots in new domains with much less expert human effort (Thomason et al., 2015; Padmakumar et al., 2017). Meanings of many language concepts are bound to the physical world. Understanding object properties and categories, such as heavy, green, and mug requires interacting with and perceiving the physical world. Embodied robots can use manipulation capabilities, such as pushing, picking up, and dropping objects to gather sensory data about them. This data can be used to understand non-visual concepts like heavy and empty (e.g. get the empty carton of milk from the fridge), and assist with concepts that have both visual and non-visual expression (e.g. tall things look big and also exert force sooner than short things when pressed down on). A second thrust of this thesis focuses on strategies for learning these concepts using multi-modal sensory information. We use human-in-the-loop learning to get labels between concept words and actual objects in the environment (Thomason et al., 2016, 2017). We also explore ways to tease out polysemy and synonymy in concept words (Thomason and Mooney, 2017) such as light, which can refer to a weight or a color, the latter sense being synonymous with pale. Additionally, pushing, picking up, and dropping objects to gather sensory information is prohibitively time-consuming, so we investigate strategies for using linguistic information and human input to expedite exploration when learning a new concept (Thomason et al., 2018). Finally, we build an integrated agent with both parsing and perception capabilities that learns from conversations with users to improve both components over time. We demonstrate that parser learning from conversations (Thomason et al., 2015) can be combined with multi-modal perception (Thomason et al., 2016) using predicate-object labels gathered through opportunistic active learning (Thomason et al., 2017) during those conversations to improve performance for understanding natural language commands from humans. Human users also qualitatively rate this integrated learning agent as more usable after it has improved from conversation-based learning.Computer Science
Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering
We train a language model (LM) to robustly answer multistep questions by
generating and answering sub-questions. We propose Chain-of-Questions, a
framework that trains a model to generate sub-questions and sub-answers one at
a time by leveraging human annotated question decomposition meaning
representation (QDMR). The key technical challenge is that QDMR only contains
sub-questions but not answers to those sub-questions, so we treat sub-answers
as latent variables and optimize them using a novel dynamic mixture of Hard-EM
and MAPO. Chain-of-Questions greatly outperforms strong neuro-symbolic methods
by 9.0 F1 on DROP contrast set, and outperforms GPT-3.5 by 24.3 F1 on HOTPOTQA
adversarial set, thus demonstrating the effectiveness and robustness of our
framework.Comment: 12 pages, 2 figure
Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems
For vision-and-language reasoning tasks, both fully connectionist, end-to-end
methods and hybrid, neuro-symbolic methods have achieved high in-distribution
performance. In which out-of-distribution settings does each paradigm excel? We
investigate this question on both single-image and multi-image visual
question-answering through four types of generalization tests: a novel
segment-combine test for multi-image queries, contrast set, compositional
generalization, and cross-benchmark transfer. Vision-and-language end-to-end
trained systems exhibit sizeable performance drops across all these tests.
Neuro-symbolic methods suffer even more on cross-benchmark transfer from GQA to
VQA, but they show smaller accuracy drops on the other generalization tests and
their performance quickly improves by few-shot training. Overall, our results
demonstrate the complementary benefits of these two paradigms, and emphasize
the importance of using a diverse suite of generalization tests to fully
characterize model robustness to distribution shift.Comment: Accepted by the Findings of EMNLP 202
Improving Sign Recognition with Phonology
We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition accuracy on the WLASL benchmark, with consistent improvements in ISLR regardless of the underlying prediction model architecture. This work has the potential to accelerate linguistic research in the domain of signed languages and reduce communication barriers between deaf and hearing people
Do Localization Methods Actually Localize Memorized Data in LLMs?
Large language models (LLMs) can memorize many pretrained sequences verbatim.
This paper studies if we can locate a small set of neurons in LLMs responsible
for memorizing a given sequence. While the concept of localization is often
mentioned in prior work, methods for localization have never been
systematically and directly evaluated; we address this with two benchmarking
approaches. In our INJ Benchmark, we actively inject a piece of new information
into a small subset of LLM weights and measure whether localization methods can
identify these "ground truth" weights. In the DEL Benchmark, we study
localization of pretrained data that LLMs have already memorized; while this
setting lacks ground truth, we can still evaluate localization by measuring
whether dropping out located neurons erases a memorized sequence from the
model. We evaluate five localization methods on our two benchmarks, and both
show similar rankings. All methods exhibit promising localization ability,
especially for pruning-based methods, though the neurons they identify are not
necessarily specific to a single memorized sequence
Geolocated Social Media Posts are Happier: Understanding the Characteristics of Check-in Posts on Twitter
The increasing prevalence of location-sharing features on social media has
enabled researchers to ground computational social science research using
geolocated data, affording opportunities to study human mobility, the impact of
real-world events, and more. This paper analyzes what crucially separates posts
with geotags from those without. We find that users who share location are not
representative of the social media user population at large, jeopardizing the
generalizability of research that uses only geolocated data.We consider three
aspects: affect -- sentiment and emotions, content -- textual and non-textual,
and audience engagement. By comparing a dataset of 1.3 million geotagged tweets
with a random dataset of the same size, we show that geotagged posts on Twitter
exhibit significantly more positivity, are often about joyous and special
events such as weddings or graduations, convey more collectivism rather than
individualism, and contain more additional features such as hashtags or objects
in images, but at the same time generate substantially less engagement. These
findings suggest there exist significant differences in the messages conveyed
in geotagged posts. Our research carries important implications for future
research utilizing geolocation social media data.Comment: 11 pages, 10 figures, 2 table